Crate decancer

source ·
Expand description

decancer npm crates.io npm downloads crates.io downloads code style: prettier Build Status license BLAZINGLY FAST!!!

A tiny package that removes common unicode confusables/homoglyphs from strings.

  • It’s core is written in Rust and utilizes a form of Binary Search to ensure speed!
  • It virtually has no third-party dependencies - it only depends on itself.
  • It stores it’s huge collection of codepoints in an optimized 25.72 KB binary file instead of a huge JSON or text file to optimize it’s bundle size!
  • It’s capable of filtering 149,513 (13.42%) different unicode codepoints including 9,628 different confusables, like:
  • And it’s available in the following languages:

Installation

Rust (v1.64 or later)

In your Cargo.toml:

decancer = "1.6.5"
JavaScript (Node.js)

In your shell:

$ npm install decancer

In your code (CommonJS):

const decancer = require('decancer')

In your code (ESM):

import decancer from 'decancer'
JavaScript (Browser)

In your code:

<script type="module">
  import init from 'https://cdn.jsdelivr.net/gh/null8626/decancer@v1.6.5/bindings/wasm/bin/decancer.min.js'

  const decancer = await init()
</script>
C/C++

Download

Building from source

Prerequisites:

$ git clone https://github.com/null8626/decancer.git --depth 1
$ cd decancer/bindings/native
$ cargo build --release

And the binary files should be generated in the target/release directory.

Examples

note: cured output will always be in lowercase.

Rust

For more information, please read the documentation.

let cured = decancer::cure("vEⓡ𝔂 𝔽𝕌Ňℕy ţ乇𝕏𝓣");

assert_eq!(cured, "very funny text");
assert!(cured.contains("FuNny"));
assert_eq!(cured.into_str(), String::from("very funny text"));

assert_eq!(decancer::cure("v̵̨̟̩͕̭̼͍̜͊̎̽̅͊̍́̏̓̕ͅe̴̡͙̳̭͚͕͕̞̦̱͊͗̈̓̑̈́̀͘ͅr̵̡̢̫̞͕͎̱͇̠͕͎̺̱̭̪̈͜ͅy̴̧̯͈̥͔̣̫̮̦̪͎̮͑̄̏̂̽͘̚͘̚͜͜͠ ̸̨̛̬͈̲̗͕̜͚̟̈̔́̾͝f̷̪̺͓̽̃̽̀̀̓̽́̾͗̋̇̀̀͐u̴͕̜̗͛̈͆̐n̸̡͙̣̙̳̥͕̥̼̪̻̪̋̀̀̽̈́͜n̷̨̗͖̗̹̜͈̗̲͔͕͉̗̻͓̟̓̽̾͗͑̾̈͜ỹ̶̧̧̩̜̹̩̩̠̦͉̮̳̦̀͛͗̒͑̅̿͌͋͠ ̴̛̠͕̥͇͉̙̯͙̠͇̝̍̃̓̆̈́̐͊̈́͘͝͠t̴̨̰̜̟͓̬͊̂̽̃͌́͂̓̊̅̃̕̚ͅȩ̵̛̬͈͔̮͙͇̫̄̽͒̊́́̀͒̚x̸̖͖̜͍̣̹̺̟̬̞̝͇̐̇̽̒͋̒̑̃̒̄̐͘͝t̸̥̅̓̉̽͑̔̑̿̇"), "very funny text");
assert_eq!(decancer::cure("foo ㍴ ㎈ls console.㏒"), "foo bar calls console.log");
assert_eq!(decancer::cure("you 🆚 w3ird un1c0de ch4rs"), "you vs weird unicode chars");
JavaScript (Node.js)
const assert = require('node:assert')
const cured = decancer('vEⓡ𝔂 𝔽𝕌Ňℕy ţ乇𝕏𝓣')

assert(cured.equals('very funny text'))
assert(cured.contains('funny'))

console.log(cured.toString())
// => 'very funny text'

assert(decancer("v̵̨̟̩͕̭̼͍̜͊̎̽̅͊̍́̏̓̕ͅe̴̡͙̳̭͚͕͕̞̦̱͊͗̈̓̑̈́̀͘ͅr̵̡̢̫̞͕͎̱͇̠͕͎̺̱̭̪̈͜ͅy̴̧̯͈̥͔̣̫̮̦̪͎̮͑̄̏̂̽͘̚͘̚͜͜͠ ̸̨̛̬͈̲̗͕̜͚̟̈̔́̾͝f̷̪̺͓̽̃̽̀̀̓̽́̾͗̋̇̀̀͐u̴͕̜̗͛̈͆̐n̸̡͙̣̙̳̥͕̥̼̪̻̪̋̀̀̽̈́͜n̷̨̗͖̗̹̜͈̗̲͔͕͉̗̻͓̟̓̽̾͗͑̾̈͜ỹ̶̧̧̩̜̹̩̩̠̦͉̮̳̦̀͛͗̒͑̅̿͌͋͠ ̴̛̠͕̥͇͉̙̯͙̠͇̝̍̃̓̆̈́̐͊̈́͘͝͠t̴̨̰̜̟͓̬͊̂̽̃͌́͂̓̊̅̃̕̚ͅȩ̵̛̬͈͔̮͙͇̫̄̽͒̊́́̀͒̚x̸̖͖̜͍̣̹̺̟̬̞̝͇̐̇̽̒͋̒̑̃̒̄̐͘͝t̸̥̅̓̉̽͑̔̑̿̇"), "very funny text")
assert(decancer("foo ㍴ ㎈ls console.㏒"), "foo bar calls console.log")
assert(decancer("you 🆚 w3ird un1c0de ch4rs"), "you vs weird unicode chars")
JavaScript (Browser)
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>Decancerer!!! (tm)</title>
    <style>
      textarea {
        font-size: 30px;
      }

      #cure {
        font-size: 20px;
        padding: 5px 30px;
      }
    </style>
  </head>
  <body>
    <h3>Input cancerous text here:</h3>
    <textarea rows="10" cols="30"></textarea>
    <br />
    <button id="cure" onclick="cure()">cure!</button>
    <script type="module">
      import init from 'https://cdn.jsdelivr.net/gh/null8626/decancer@v1.6.5/bindings/wasm/bin/decancer.min.js'

      const decancer = await init()

      window.cure = function () {
        const textarea = document.querySelector('textarea')

        if (!textarea.value.length) {
          return alert("There's no text!!!")
        }

        textarea.value = decancer(textarea.value).toString()
      }
    </script>
  </body>
</html>

See this in action here.

C/C++
#include <decancer.h>

#include <string.h>
#include <stdlib.h>
#include <stdio.h>

// global variable for assertion purposes only
decancer_cured_t cured;

static void assert(const bool expr, const char *message)
{
    if (!expr)
    {
        fprintf(stderr, "assertion failed (%s)\n", message);
        decancer_free(cured);
        
        exit(1);
    }
}

int main(void) {
    // utf-8 bytes for "vEⓡ𝔂 𝔽𝕌Ňℕy ţ乇𝕏𝓣"
    uint8_t string[] = {0x76, 0xef, 0xbc, 0xa5, 0xe2, 0x93, 0xa1, 0xf0, 0x9d, 0x94, 0x82, 0x20, 0xf0, 0x9d,
                        0x94, 0xbd, 0xf0, 0x9d, 0x95, 0x8c, 0xc5, 0x87, 0xe2, 0x84, 0x95, 0xef, 0xbd, 0x99,
                        0x20, 0xc5, 0xa3, 0xe4, 0xb9, 0x87, 0xf0, 0x9d, 0x95, 0x8f, 0xf0, 0x9d, 0x93, 0xa3};

    cured = decancer_cure(string, sizeof(string));

    assert(decancer_equals(cured, (uint8_t *)("very funny text"), 15), "equals");
    assert(decancer_contains(cured, (uint8_t *)("funny"), 5), "contains");

    // coerce output as a raw UTF-8 pointer and retrieve it's size (in bytes)
    size_t output_size;
    const uint8_t *output_raw = decancer_raw(cured, &output_size);

    assert(output_size == 15, "raw output size");

    // utf-8 bytes for "very funny text"
    const uint8_t expected_raw[] = {0x76, 0x65, 0x72, 0x79, 0x20, 0x66, 0x75, 0x6e,
                                    0x6e, 0x79, 0x20, 0x74, 0x65, 0x78, 0x74};

    char assert_message[38];
    for (uint32_t i = 0; i < sizeof(expected_raw); i++)
    {
        sprintf(assert_message, "mismatched utf-8 contents at index %u", i);
        assert(output_raw[i] == expected_raw[i], assert_message);
    }

    decancer_free(cured);    
    return 0;
}

Contributing

Please read CONTRIBUTING.md for newbie contributors who want to contribute!

Structs

Enums

  • The translation for a single character/codepoint.

Functions

  • Cures a string. Output will always be in lowercase and all overridden comparison methods provided by CuredString is case-insensitive.
  • Cures a single character/unicode codepoint. Output will always be in lowercase and equality methods provided by Translation is case-insensitive.